fix(action_status_bridge): retry deferred fault reports instead of dropping them by bburda · Pull Request #472 · selfpatch/ros2_medkit

bburda · 2026-06-23T13:36:24Z

Pull Request

Summary

Two fixes (see commits):

1. action_status_bridge: don't lose / promptly deliver faults when the FaultManager service isn't discovered yet (closes #471)

The bridge silently lost the first fault of a failing action when the FaultManager's report_fault service was not yet discovered at the instant the bridge observed the terminal status (startup discovery window; exposed by DDS timing - failing on Humble, passing on Jazzy).

Root cause: on the healthy -> failed transition apply_message() called the fire-and-forget reporter->report() and then committed last_reported_state_ = kFailed unconditionally. If the service was not ready, FaultReporter dropped the request - but the bridge had already recorded the action as reported. The latched (transient_local) status is delivered to a subscription exactly once and a rescan skips already-subscribed topics, so the dropped report was never retried and the high-severity fault was lost permanently.

Fix:

Track the latest observed action state (desired_state_) separately from what the FaultManager was told (last_reported_state_); deliver via reconcile() only when they differ and the report is deliverable (service ready; null reporter = unit-test seam), committing only after delivery.
Deliver as fast as the discovery floor allows: the FaultManager captures its freeze-frame snapshot when it receives the report, so delivery latency makes that snapshot stale. The happy path reports synchronously in the status callback; a deferred report is retried on a dedicated fast retry timer (retry_period_sec, default 50 ms, armed only while pending), decoupled from the slow discovery rescan, so it lands within one short tick of the service appearing. The executor is deliberately not blocked. The README documents this delivery-latency model and its level-triggered boundaries.
reconcile() holds state_mutex_ across gate -> report -> commit so the status callback and retry timer can't double-report under a multi-threaded executor.
The prune-vanished heal is gated on service readiness and warns (instead of silently dropping) when a vanished action's heal can't be delivered.
The shared FaultReporter fire-and-forget contract is unchanged.

2. gateway: fix flaky SSE handler test under ASan

SSEFaultHandlerTest.StreamSnapshotsEntityContextAtEnqueue flaked under AddressSanitizer. enqueue_event() spun for a fixed budget after publishing; under sanitizer slowdown the handler's on_fault_event (which snapshots the owning entity) could run after that budget, so the test cleared the cache before the snapshot was taken and the replayed event omitted x-medkit. Added SSEFaultHandler::events_received() (a monotonic counter, like connected_clients()) and made enqueue_event() wait until the handler has actually consumed the event. Test-only timing fix; no product behavior change.

Issue

closes [BUG] action_status_bridge loses first fault when FaultManager service not yet discovered (startup race) #471

Type

Bug fix
New feature or tests
Breaking change
Documentation only

Testing

action_status_bridge: 31 unit tests pass (incl. deferred raise/heal delivered once the service is ready, and the retry pass not dropping while the service is down); the test_integration launch test passes (raise + heal); linters + clang-tidy clean. CI Pixi (humble) (originally failing) and Humble/Jazzy/Lyrical/TSan green on the prior commits.
gateway SSE: full test_sse_fault_handler (19 tests) passes; the snapshot test passes 30/30 in a repeat loop; clang-format + clang-tidy clean on the changed files.

Checklist

Breaking changes are clearly described (and announced in docs / changelog if needed)
Tests were added or updated if needed
Docs were updated if behavior or public API changed

…opping them When a failing action's terminal status reaches the bridge before the FaultManager report service is discovered (e.g. a latched transient_local status delivered during the startup discovery window), the fire-and-forget report() is dropped. The bridge previously committed last_reported_state_ unconditionally, so the dropped high-severity fault was never retried: the latched status is delivered to a subscription only once, and a rescan skips already-subscribed topics, so no second message ever re-drives the report. Track the latest observed state (desired) separately from what the FaultManager was actually told (reported), and commit the transition only once the report is deliverable (a null reporter is the unit-test seam; a real reporter must have a discovered service). reconcile_pending() re-attempts undelivered transitions on the existing rescan tick, so a report dropped during the discovery window is delivered as soon as the service appears. The "Action ... -> fault" log now fires only on actual delivery rather than after a silent drop. Closes #471

Copilot

Pull request overview

Fixes a startup discovery race in ros2_medkit_action_status_bridge where a terminal (latched) ABORTED/CANCELED action status could be observed before the FaultManager report_fault service was discovered, causing the bridge to “commit” the failure locally even though FaultReporter dropped the report—permanently losing the fault.

Changes:

Split “latest observed” per-action state (desired_state_) from “successfully delivered to FaultManager” state (last_reported_state_), and only commit after delivery via a new reconcile() path.
Add reconcile_pending() on the existing rescan tick to retry deferred raises/heals once the service becomes ready.
Add a regression unit test ensuring deferred faults remain pending (not committed/dropped) when the FaultManager service is unavailable; update README to document the behavior.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File	Description
`src/ros2_medkit_action_status_bridge/src/action_status_bridge_node.cpp`	Implements desired-vs-reported reconciliation and rescan-driven retries for deferred fault delivery.
`src/ros2_medkit_action_status_bridge/include/ros2_medkit_action_status_bridge/action_status_bridge_node.hpp`	Documents the new reconcile model and adds `DesiredState` / `desired_state_` tracking.
`src/ros2_medkit_action_status_bridge/test/test_action_status_bridge.cpp`	Adds a regression test covering “service unavailable => fault stays pending” behavior.
`src/ros2_medkit_action_status_bridge/README.md`	Documents deferred delivery + retry semantics during startup discovery.

SSEFaultHandlerTest.StreamSnapshotsEntityContextAtEnqueue flaked under AddressSanitizer: enqueue_event() spun for a fixed budget (20 x 10 ms) after publishing a FaultEvent, but under sanitizer slowdown on_fault_event - which snapshots the owning entity - could run after that budget. The test then cleared the cache before the snapshot was taken, so the replayed event omitted x-medkit and the assertion failed. Add SSEFaultHandler::events_received() (a monotonic processed-event counter, mirroring connected_clients()) and make enqueue_event() spin until the handler has actually consumed the event before returning. The entity snapshot is then always taken while the App is still in the cache, independent of execution speed. No product behavior change; entity resolution stays synchronous in on_fault_event.

…t retry timer A fault deferred because the FaultManager service was not yet discovered was only retried on the slow rescan tick (default 2s). The FaultManager captures the freeze-frame snapshot when it receives the report, so that latency makes the snapshot stale - a late fault is reported against context that has already moved on. Deliver as soon as the report channel is available instead. - Add a dedicated fast retry timer (retry_period_sec, default 50 ms), decoupled from the discovery rescan and armed only while a delivery is pending, so a deferred report lands within one short tick of the service appearing. The happy path (service discovered) still reports synchronously in the status callback. - Hold state_mutex_ across the reconcile gate, report() and commit so the status callback and the retry timer cannot double-report under a multi-threaded executor; report()/report_passed() are async so the section stays short. - Gate the prune-vanished heal on service readiness and warn when a vanished action's heal cannot be delivered, rather than dropping it silently. - Tests: deferred raise and heal are delivered once the service is ready, and the retry pass keeps retrying without dropping while the service is down. - Document the delivery-latency model and its level-triggered boundaries in the README; add retry_period_sec to config and the parameter table.

Copilot AI review requested due to automatic review settings June 23, 2026 13:36

Copilot started reviewing on behalf of bburda June 23, 2026 13:36 View session

Copilot AI reviewed Jun 23, 2026

View reviewed changes

bburda self-assigned this Jun 23, 2026

mfaferek93 reviewed Jun 23, 2026

View reviewed changes

mfaferek93 approved these changes Jun 23, 2026

View reviewed changes

bburda merged commit 243a2f4 into main Jun 23, 2026
18 of 20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(action_status_bridge): retry deferred fault reports instead of dropping them#472

fix(action_status_bridge): retry deferred fault reports instead of dropping them#472
bburda merged 3 commits into
mainfrom
fix/action-status-bridge-deferred-fault

bburda commented Jun 23, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

bburda commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request

Summary

1. action_status_bridge: don't lose / promptly deliver faults when the FaultManager service isn't discovered yet (closes #471)

2. gateway: fix flaky SSE handler test under ASan

Issue

Type

Testing

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bburda commented Jun 23, 2026 •

edited

Loading